[Feature](routine_load) auto fill missing columns#64071
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
531f6aa to
43168c2
Compare
|
/review |
1 similar comment
|
/review |
There was a problem hiding this comment.
I found blocking issues that need changes before this can be approved.
Critical checkpoints:
- Goal/test proof: the PR appears to add
fill_missing_columnssupport for JSON routine load and tests cover property parsing/show/alter validation, but the implementation currently does not compile and the Nereids routine-load execution path does not propagate the property intoJsonFileFormatProperties. - Scope/focus: the intended code change is small and focused, but the duplicate method and missing propagation make it incomplete.
- Concurrency/lifecycle/config/storage compatibility: no new concurrency, lifecycle, persisted format, or dynamic config concerns identified in this diff.
- Parallel paths: the routine-load job properties path and the Nereids file-format-property path are parallel; the latter is not updated, so runtime behavior diverges from show/alter state.
- Tests: added unit tests do not cover the Nereids execution path where
fill_missing_columnsmust be observed, and the current source would fail compilation before tests run. - Observability/performance/transactionality: no additional concerns found beyond the functional blockers.
User focus: no additional user-provided focus points were specified.
| /** | ||
| * Returns true if the file format is JSON and fill_missing_columns is enabled. Only meaningful for JSON. | ||
| */ | ||
| private boolean isFillMissingColumns(NereidsBrokerFileGroup fileGroup) { |
There was a problem hiding this comment.
This duplicates the isFillMissingColumns(NereidsBrokerFileGroup) method declared just above at lines 449-452, so NereidsLoadScanProvider will not compile (method isFillMissingColumns(...) is already defined). Please remove one of the duplicate declarations before this can pass FE compilation.
| // So that the following process can be unified | ||
| boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p -> p.isColumn()); | ||
| if (!specifyFileFieldNames) { | ||
| if (!specifyFileFieldNames || isFillMissingColumns(fileGroup)) { |
There was a problem hiding this comment.
For routine load this condition will still be false even when the job property was set. KafkaRoutineLoadJob.toNereidsRoutineLoadTaskInfo() copies jobProperties into NereidsRoutineLoadTaskInfo, but NereidsDataDescription(NereidsLoadTaskInfo) only copies JSON props like strip_outer_array, jsonpaths, json_root, fuzzy_parse, read_json_by_line, and num_as_string into analysisMap. It never copies fill_missing_columns, so analyzeFileFormatProperties() builds a JsonFileFormatProperties with the default false, and this new branch is not taken for the routine-load execution path. Please add the property to the Nereids task-info/data-description propagation path and cover it with a test that reaches NereidsLoadScanProvider.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
For routine load jobs, if we have thousands of columns to import, we can use json format let doris automatically match the column name with table schema. But for some scenarios, the upstream data does not have the column name(can be derived from current columns) and we can not add new column name easily, so we consider if we can just specify the new derived column only and let doris fill the missing columns with table schema.
e.g.
The MQ data have
stat_time_dayfield but does not havep_datecolumn, we can add new column like thisp_date = stat_time_day, fill missing columns from table schema.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)